Hosur'Tech Participation in Interactive INFILE

نویسندگان

John Anton Chrisostom Ronald

Aurélie Rossi

Christian Fluhr

چکیده

Tasks performed: interactive InFile French to French and English Main objectives of experiments: As Hossur’Tech started from scratch in mid January to build an information extraction system based on a deep linguistic analysis, InFile runs were too early to be able to use our linguistic tools. Our objective in performing runs was to experiment comparison methods on real data to help us to design our future system. Approach used: topics have been processed using a limited version of XFST with our own resources. Part of speech tagging and lemmatization were obtained. For the documents, it was not possible to use the same linguistic processing because of volume limitation of our version of XFST. A simple dictionary look-up without disambiguation was used. We were only able to process French and English in time. Arabic needed a little more time. For each topic their title, description, and narrative contents were used. The example document was only used as a first positive feedback but not included strictly in the topic. For documents only title and text were used. All document words inferred monolingual equivalents (for French to French comparison) or translations (for French to English comparison). A word intersection was computed and then a concept intersection was established. All words inferred from the same word were considered as representing the same concept. Each concept contained in the topic-document intersection receives a weight according to both a statistics computed on a similar corpus (Clef corpus) and the fact that the concepts are in the topic keyword list or title or not. Proper nouns receive also an increased weight. A tentative threshold between relevant and irrelevant documents was computed between the weight of the example document and the maximum weight of documents relevant to other topics. Adaptation: The threshold has been adjusted according to the simulated feedback. Each word included into >= 2 relevant documents are included into the topic word set. We have asked 4 feedbacks for each topic which is too small according to real use of such systems. Resources employed: own dictionaries Results obtained: a great number of non relevant documents due to the fact that the feedback did not permit to adjust the threshold. The fact that we have not considered that documents could have several topics has also produced a large number of irrelevant documents. The low level of feedback for each topic (4) was not enough to add words from relevant documents in topics. ACM categories and subject descriptors: H.3.3 Information Search and Retrieval, Information filtering Free keywords: adaptive filtering, cross-lingual filtering, natural language processing

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SINAI at INFILE 2009: Experiments with Google News

This paper describes the SINAI team participation in the INFILE routing and filtering track of the CLEF campaign. This is the first participation of the SINAI research group in the INFILE task. We have participated in the batch filtering subtask and submitted two experiments: one using the topics’ text as learning data to train a classifier, and another one where training data has been construc...

متن کامل

UAIC: Participation in INFILE@CLEF Task

This year marked UAIC 1 ’s first participation at the INFILE@CLEF competition. This campaign’s purpose is the evaluation of cross-language adaptive filtering systems, which is to successfully build an automated system that separates relevant from non-relevant documents written in different languages in an incoming stream of textual information with respect to a given profile. A brief descriptio...

متن کامل

Batch Document Filtering Using Nearest Neighbor Algorithm

This paper describes the participation of LIG lab, in the batch filtering task for the INFILE (INformation FILtering Evaluation) campaign of CLEF 2009. As opposed to the online task, where the server provides the documents one by one, all of the documents are provided beforehand in the batch task, which explains the fact that feedback is not possible in the batch task. We propose in this paper ...

متن کامل

Overview of CLEF 2009 INFILE track

The INFILE@CLEF 2009 track is the second run of this track on the evaluation of cross-language adaptive filtering systems. It uses the same corpus as the 2008 track, composed of 300,000 newswires from Agence France Presse (AFP) in three languages: Arabic, English and French, and a set of 50 topics in general and specific domain (scientific and technological information). We proposed this year t...

متن کامل